Beyond “If then”-- Three Techniques for Cleaning Character Variables from Write-in Questions
نویسندگان
چکیده
In survey studies, cleaning answers to write-in questions can be difficult and time consuming, especially when the same response may be written in multiple ways. Misunderstanding of the survey question, unrecognizable handwriting, and negligence in data entry are major factors leading to data inaccuracies that are almost impossible to avoid. Writing a series of “if then” statements is a classic solution to cleaning data. However with growing datasets, the number of conditions to be tested grow too, until thousands of “if then” statements may be required. This paper presents three techniques that we used to clean up the country of birth questions in the Cancer Prevention Study-3 (CPS-3). Combining data merging and Excel spreadsheets, using LIKE and SOUND LIKE operators, and implementing join tables with compare functions in the SQL procedure not only reduced the workload and eases the stress in cleaning character variables but also added some flavors to this tedious task.
منابع مشابه
Autonomous cleaning of corrupted scanned documents - A generative modeling approach
We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representatio...
متن کاملA Comparative Study of the Views of Kristjansson, Battaly and Martyr Motahhari on the Effect of Theoretical Wisdom and Modeling on Moral Character
“Moral character” is a subject related to the psyche and an important topic in moral psychology. The question is whether the formation of moral character is inherent and unchangeable. If it is acquired, what factors affect it? If a character becomes vicious, is it possible to correct it? Considering the importance of the aforementioned questions, it was necessary to study the issue of “shaping ...
متن کاملThe Most Cost Effective Gas Cleaning Device in Steel Industry with Industrial Ecology Approach
Industrial growth and environmental damages, as two important indicators in sustainable development are followed by steel industry. This article leads industries to green industry. In this case, energy, material, capital consumption and environmental damages as sustainability patterns of environment have been investigated in three different dust collectors to select the most environmentally sui...
متن کاملاحمد جام، از افسانه تا حقیقت
Among the well known figures of Sufies, the life and the character of Ahmad Jām is more tied by legends and strange anecdotes. Official authors have drawn his figure by uttering the legends and wonderful anecdotes and attributing features to him such as breaking jar, dogmatism and increasing enjoying and prohibited from denying and breaking harp in a way that it was of his favorite and they did...
متن کاملAutonomous Document Cleaning - A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts
We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink, etc. We aim at autonomously removing such corruptions from a single letter-size page based only on the information the page contains. Our approach first learns character representations from document patches without supervision. For learning, we use a probabilistic ...
متن کامل